fix: file size mismatch when download interrupted#176
Conversation
chore: replace workflow repo tags by variables
When downloading large files (e.g., 117GB photon-db), connection failures cause the entire download to restart from scratch instead of resuming, because download state is saved every 1MB. When connection drops mid-download, the file has MORE data than the last saved state and on retry, the exact size check fails. This fix ensures progress is saved even if data written between periodic saves (every 1MB), preventing loss of up to 1MB of progress. Also, increased connection timeout (30s) and read timeout (60s), and added backoff timeout to be more reliable in the slow and unstable networs.
📝 WalkthroughWalkthroughUpdates to the downloader module introduce more robust download handling with improved resume logic, error handling with state persistence, extended HTTP timeouts with differentiated connect and read settings, and exponential backoff retry logic on request failures. Changes
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (1)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thanks for the fix! |
Problem
When downloading large files (e.g., a 117GB photon-db), connection failures cause the entire download to restart from scratch instead of resuming, because the download state is saved only every 1MB.
Why it happens:
When the connection drops mid-download, the file contains more data than the last saved state, and on retry, the exact size check fails.
save_interval = 1024 * 1024)Fix
This fix ensures that progress is saved even if data written between periodic saves (every 1MB), preventing loss of up to 1MB of progress.
Additionally, increases connection timeout (30s) and read timeout (60s), and adds a backoff timeout to make downloads more reliable on slow or unstable networks.
Test
I built a new container and ran tests in my slowest setup.
After a network interruption, the Python script continued downloading the file:
It may still be worth increasing the read timeout further (e.g., from
timeout=(30, 60)totimeout=(30, 300)), but at least the download now continues instead of restarting from the beginning.Fixes #169